{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Goodreads \"Classics\": Authorless Topic Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By [Maria Antoniak](https://maria-antoniak.github.io/) and [Melanie Walsh](https://melaniewalsh.org/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains some of the Python code behind our article \"The Goodreads 'Classics': A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism.\"\n",
    "\n",
    "In this notebook, we pre-process Goodreads reviews with Laure Thompson and David Mimno's [Authorless Topic Model package](https://github.com/laurejt/authorless-tms). Then we run a topic model on the downsampled data with [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper) (figures 10-13)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Prepare Data for Authorless Topic Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import necessary Python libraries and pakages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from collections import defaultdict\n",
    "\n",
    "#For topic modeling\n",
    "import little_mallet_wrapper as lmw\n",
    "\n",
    "#For data analysis and manipulation\n",
    "import pandas as pd\n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Establish file paths to MALLET, Goodreads reviews, book metadata, and topic model outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# File path to MALLET\n",
    "mallet_path = '/Volumes/Passport-1/packages/mallet-2.0.8/bin/mallet'\n",
    "\n",
    "# Directory Where All the Data Is Located\n",
    "data_directory_path = '/Volumes/Passport-1/data/goodreads/classics-shelf'\n",
    "\n",
    "# Classic Books Metadata (Pre-processed)\n",
    "processed_books_path              = data_directory_path + '/books.processed.json'\n",
    "\n",
    "# Goodreads Reviews of Classics (Pre-processed)\n",
    "processed_reviews_newest_path     = data_directory_path + '/reviews_newest.english.processed.json'\n",
    "processed_reviews_oldest_path     = data_directory_path + '/reviews_oldest.english.processed.json'\n",
    "processed_reviews_most_liked_path = data_directory_path + '/reviews_most_liked.english.processed.json'\n",
    "\n",
    "# Files paths for topic model results\n",
    "topics_directory_path = '/Volumes/Passport-1/output/goodreads/classics/topics-authorless'\n",
    "training_path = topics_directory_path + '/data/train.txt'\n",
    "vocab_path    = topics_directory_path + '/data/vocab.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a function to pull out review text, author of the classic book being reviewed, and review ID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_reviews(reviews_path, id_book_dict):\n",
    "    \n",
    "    reviews = json.load(open(reviews_path, 'r'))\n",
    "    \n",
    "    texts = []\n",
    "    authors = []\n",
    "    ids = []\n",
    "\n",
    "    for r in reviews:\n",
    "        if r['processedTokens']:\n",
    "            texts.append(' '.join(r['processedTokens']))\n",
    "            ids.append(r['reviewID'])\n",
    "            authors.append(' '.join(id_book_dict[r['bookID']]['author'].strip().split()))\n",
    "\n",
    "    return texts, authors, ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a function to load the books metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_books(processed_books_path):\n",
    "    books = json.load(open(processed_books_path, 'r'))\n",
    "    id_book_dict = {b['bookID']: b for b in books}\n",
    "    return id_book_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of training documents: 128548\n"
     ]
    }
   ],
   "source": [
    "#Load books\n",
    "id_book_dict = load_books(processed_books_path)\n",
    "\n",
    "#Load reviews\n",
    "texts, authors, ids = [], [], []\n",
    "for _path in [processed_reviews_newest_path, processed_reviews_oldest_path, processed_reviews_most_liked_path]:\n",
    "    _texts, _authors, _ids = load_reviews(_path, id_book_dict)\n",
    "    texts += _texts\n",
    "    authors += _authors\n",
    "    ids += _ids\n",
    "print('Number of training documents: ' + str(len(texts)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save training data — training.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training data consists of a tab-separated file where every line contains a Goodreads review ID, the author of the classic book being reviewed, and a Goodreads review text for a corresponding classic book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_file = open(training_path, 'w')\n",
    "for _text, _id, _author in zip(texts, ids, authors):\n",
    "    output_file.write(_id + '\\t' + _author + '\\t' + _text + '\\n')\n",
    "output_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get and save vocab — vocab.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Vocab consists of every unique word across all the Goodreads reviews. We write the vocab to a file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocab size: 120744\n"
     ]
    }
   ],
   "source": [
    "word_count_dict = defaultdict(int)\n",
    "for t in texts:\n",
    "    for w in t.split():\n",
    "        word_count_dict[w] += 1\n",
    "vocab = list(word_count_dict.keys())\n",
    "print('Vocab size: ' + str(len(vocab)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_file = open(vocab_path, 'w')\n",
    "for w in vocab:\n",
    "    output_file.write(w + '\\n')\n",
    "output_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Downsample the Training Data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After downloading the [authorless-tms scripts](https://github.com/laurejt/authorless-tms) created by Laure Thompson and David Mimno, we run `authorless-tms/downsample_corpus.py`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note that this script should ideally be run from the command line.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python authorless-tms/downsample_corpus.py --input /Volumes/Passport-1/output/goodreads/classics/topics-authorless/data/train.txt \\\n",
    "                                            --vocab /Volumes/Passport-1/output/goodreads/classics/topics-authorless/data/vocab.txt \\\n",
    "                                            --output /Volumes/Passport-1/output/goodreads/classics/topics-authorless/data/train.authorless.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 3: Run Topic Model on Authorless Training Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Establish number of topics and file paths to topic model results and authorless training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of topics to return\n",
    "num_topics = 30\n",
    "\n",
    "# File paths for topic model results\n",
    "base_data_path = '/Volumes/Passport-1/data/goodreads/classics-shelf'\n",
    "base_output_path = '/Volumes/Passport-1/output/goodreads/classics/topics-authorless'\n",
    "\n",
    "# File path to authorless training data\n",
    "authorless_data_path = base_output_path + '/data/train.authorless.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load authorless/downsampled Goodreads review texts "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = []\n",
    "for _line in open(authorless_data_path, 'r'):\n",
    "    if _line.strip():\n",
    "        _id, _author, _text = _line.split('\\t')\n",
    "        texts.append(_text)\n",
    "print('Number of training documents: ' + str(len(texts)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-process Goodreads review texts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use Little Mallet Wrapper function to process the text of each Goodreads review."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [lmw.process_string(text) for t in texts]\n",
    "texts = [text.replace('book', '') for t in texts]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train topic model on authorless Goodreads review texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lmw.quick_train_topic_model(mallet_path,\n",
    "                            base_output_path,\n",
    "                            num_topics,\n",
    "                            texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examine topic keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 \t love character woman novel characters much marriage women young jane\n",
      "1 \t people like think things way really know one would see\n",
      "2 \t world society people human one future like new would science\n",
      "3 \t man one life human self nature yet upon even perhaps\n",
      "4 \t read story really enjoyed interesting reading would good classic found\n",
      "5 \t movie read story audio version film seen much listened reading\n",
      "6 \t spoiler would one view even much though quite hide think\n",
      "7 \t read school high reading class time first remember english year\n",
      "8 \t read stars review one reading give like would say think\n",
      "9 \t story read writing beautiful written characters one well way amazing\n",
      "10 \t like one would said could time back old eyes day\n",
      "11 \t really like didn much story liked felt read loved think\n",
      "12 \t read one time love favorite ever first still years loved\n",
      "13 \t like get know really good think one going people thing\n",
      "14 \t like didn get characters even writing could really boring much\n",
      "15 \t war history one people life great novel good many story\n",
      "16 \t love one heart never life would know could say world\n",
      "17 \t com que www https http non review una con los\n",
      "18 \t women people men woman also time white black society many\n",
      "19 \t life one story people way many death lives world live\n",
      "20 \t read reading time first years pages long started one get\n",
      "21 \t children story read little child young old adult one reading\n",
      "22 \t world story one adventure journey time first king great many\n",
      "23 \t characters one read fun great funny much play humor love\n",
      "24 \t story characters character plot end first interesting main really part\n",
      "25 \t read edition reading english translation language work version first original\n",
      "26 \t novel characters reader character narrative one also story many plot\n",
      "27 \t one death man good would play evil murder king end\n",
      "28 \t novel story one first fiction novels written classic published time\n",
      "29 \t family man life father young story mother home wife old\n"
     ]
    }
   ],
   "source": [
    "topic_keys = lmw.load_topic_keys(topics_directory_path + '/mallet.topic_keys.' + str(num_topics))\n",
    "\n",
    "for i, t in enumerate(topic_keys):\n",
    "    print(i, '\\t', ' '.join(t[:10]))"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
